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Abstract. — Bayesian inference provides an appealing general framework for phylogenetic analysis, able to incorporate 
a wide variety of modeling assumptions and to provide a coherent treatment of uncertainty. Existing computational 
approaches to Bayesian inference based on Markov chain Monte Carlo (MCMC) have not, however, kept pace with the 
scale of the data analysis problems in phylogenetics, and this has hindered the adoption of Bayesian methods. In this 
paper, we present an alternative to MCMC based on Sequential Monte Carlo (SMC). We develop an extension of classical 
SMC based on partially ordered sets and show how to apply this framework — which we refer to as PosetSMC — to phyloge- 
netic analysis. We provide a theoretical treatment of PosetSMC and also present experimental evaluation of PosetSMC on 
both synthetic and real data. The empirical results demonstrate that PosetSMC is a very promising alternative to MCMC, 
providing up to two orders of magnitude faster convergence. We discuss other factors favorable to the adoption of 
PosetSMC in phylogenetics, including its ability to estimate marginal likelihoods, its ready implementability on parallel 
and distributed computing platforms, and the possibility of combining with MCMC in hybrid MCMC-SMC schemes. Soft- 
ware for PosetSMC is available at http://www.stat.ubc.ca/ bouchard/ PosetSMC. [Bayesian inference; sequential Monte 
Carlo.] 



Although Bayesian approaches to phylogenetic infer- 
ence have many conceptual advantages, there are seri- 
ous computational challenges needing to be addressed 
if Bayesian methods are to be more widely deployed. 
These challenges are currently being shaped by two 
trends: (i) Advances in computer systems make it possi- 
ble to perform iterations of Markov chain Monte Carlo 
(MCMC) algorithms more quickly than before and (ii) 
however, thanks to advances in sequencing technology, 
the amount of data available in typical phylogenetic 
studies is increasing rapidly. The latter rate currently 
dominates the former, and, as a consequence, there 
are increasing numbers of phylogenetic data sets that 
are out of the reach of Bayesian inference. Moreover, 
while future improvement in computational perfor- 
mance is expected to come in the form of paralleliza- 
tion, current methods for parallelizing MCMC (Feng 
et al. 2003; Altekar et al. 2004; Keane et al. 2005; Suchard 
and Rambaut 2009) have important limitations (see the 
Discussion section). 

Another issue with MCMC algorithms is the diffi- 
culty of computing the marginal likelihood, a quantity 
needed in Bayesian testing (Robert 2001). The naive es- 
timator has unbounded variance (Newton and Raftery 
1994), and alternatives, such as thermodynamic inte- 
gration (Lartillot and Philippe 2006), are nontrivial to 
implement in the phylogenetic setting (Gelman and 
Meng 1998; Fan et al. 2011; Xie et al. 2011). 

In this paper, we propose an alternative methodol- 
ogy for Bayesian inference in phylogeny that is based 
on Sequential Monte Carlo (SMC). Although SMC is 
generally applied to problems in sequential data analy- 
sis (Doucet et al. 2001), we develop a generalized form 
of SMC — which we refer to as PosetSMC — that applies 
directly to phylogenetic data analysis. We present exper- 



iments that show that for a large range of phylogenetic 
models and reconstruction precision levels, PosetSMC 
is significantly faster than MCMC. Moreover, PosetSMC 
automatically provides a well-behaved estimate of the 
marginal likelihood. 

There have been earlier attempts to apply classical 
SMC to inference in tree-structured models (Gorur and 
Teh 2008; Teh et al. 2008), but this work was restricted 
to models based on the coalescent. There has also been 
a large body of work applying sequential methods that 
are closely related to SMC (e.g., importance sampling 
and approximate Bayesian computation) to problems 
in population genetics, but this work also focuses on 
coalescent models (Griffiths and Tavare 1996; Beaumont 
et al. 2002; Marjoram et al. 2002; Iorio and Griffiths 
2004; Paul et al. 2011). Although the coalescent model 
is dominant in population genetics, a wider range 
of prior families is needed in phylogenetic inference. 
Examples include the Yule process (Rannala and Yang 
1996), uniform priors, and a variety of nonclock 
priors (Huelsenbeck and Ronquist 2001). 

The general problem of Bayesian computation in 
phylogeny involves approximating an integral over 
a space T of phylogenetic trees. These integrals are 
needed for nearly all Bayesian inferences. To motivate 
the PosetSMC framework for computing these inte- 
grals, it is useful to consider the simpler problem of 
maximizing over a space of phylogenetic trees. When 
maximizing over phylogenies, two broad strategies, 
or metaalgorithms, are used: local search and sequential 
search. The two strategies use radically different meth- 
ods to compute optimal trees, and they use a different 
type of state representation. 

In local search, the representation maintained at each 
step of the algorithm is based on a full specification 
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of a phylogenetic tree (which we will call a full state 
or just a state for short). Local search starts at an arbi- 
trary state, and at each iteration, states that are nearby 
the current candidate are evaluated (where "nearby" is 
defined according to a user-specified metric, e.g., the 
metric induced by local branch exchanges or by regraft- 
ing; Felsenstein 2003). The current candidate is updated 
to one of these nearby states until a stopping criterion is 
met. Stochastic annealing is an example of a local search 
strategy. 

In sequential search, the representation maintained 
at each step of the algorithm is based on a partial spec- 
ification of a phylogenetic tree. A partial specification 
can take the form of a phylogenetic forest over the 
observed taxa, for example, where a phylogenetic for- 
est is defined as a set of phylogenetic trees. We call 
these entities partially specified states or partial states. 
Note that there is a dual representation of partial states: 
A partial state s can be viewed as a set of full states, 
D(s) C T: the set of full states that satisfy a given 
constraint. In the previous example where each partial 
state s is a forest, the dual set D(s) corresponds to the 
set of all phylogenetic trees t that contain as subtrees 
all the trees i' of the forest s, with matching branch 
lengths. 

Sequential search starts at the least partial state (the 
partial state whose dual is the set of all full states; i.e., the 
partial state where each tree in the forest consists of ex- 
actly one taxon) and proceeds iteratively as follows: At 
each step, a set of possible successors of the current par- 
tial state is considered. The successor of a state s refers to 
a new partial state s' with a dual D(s') strictly contained 
in the dual D(s) of the current partial state s. Second, a 
promising successor, or set of successors, is taken as the 
new candidate partial state(s). Neighbor joining is an ex- 
ample of a sequential search strategy, where successors 
are obtained by merging two trees in a forest, forming a 
new forest with one less tree in it. 

If an infinite computational budget were available, 
local search strategies would generally be preferred 
over sequential ones. For example, stochastic anneal- 
ing is guaranteed under conditions on the annealing 
rate to approach the optimal solution, whereas neigh- 
bor joining will maintain a fixed error. However, since 
computational time is a critical issue in practice, cheap 
algorithms such as neighbor joining are often preferred 
to more expensive alternatives. 

We now return to the Bayesian problem of integration 
over the space of trees. Note that MCMC algorithms for 
integration can be viewed as analogs of the local search 
strategies used for maximization. What would then be 
a sequential strategy for integration? This is exactly 
where SMC algorithms fit. SMC uses partial states that 
are successively extended until a fully specified state 
is reached. Many candidates are maintained simulta- 
neously (these candidate partial states are called "par- 
ticles"), and unpromising candidates are periodically 
pruned (typically by resampling). Given unbounded 
computational resources, MCMC eventually outper- 
forms SMC, but for smaller computation times, or for 



highly parallelized architectures, SMC is an attractive 
alternative. 

SMC algorithms were originally developed in the con- 
text of a restrictive class of models known as state-space 
models. While there has been work on extending SMC 
to more general setups (Moral et al. 2006; Andrieu et al. 
2010), this earlier work has been based on the assump- 
tion that the target integral is approximated by integrals 
over product spaces of increasing dimensionality, an 
assumption incompatible with the combinatorial aspect 
of phylogenetic trees. 

The work of Tom et al. (2010) applies sequential sam- 
pling and reweighting methods to phylogenetics, but in 
the context of pooling results from stratified analyses. In 
this paper, we construct a single joint tree posterior. 

The remainder of the article is organized as follows. In 
the Background and Notation section, we review some 
basic mathematical definitions and notation. We then 
outline the general PosetSMC framework and present 
theoretical guarantees for PosetSMC in the Theoreti- 
cal Guarantees section. Results on synthetic and real 
data are presented in the Experiments section, and we 
present our conclusions in the Discussion section. 



Background and Notation 

Before describing PosetSMC, we introduce our nota- 
tion, which is based on the notation of Semple and Steel 
(2003). Let X be a set of leaves (observed taxa), T be a 
random phylogenetic X tree, with values in a measur- 
able space T (we will consider both ultrametric setups 
and nonclock setups), and y, a set of observations at the 
leaves. For X' c X, we use the notation y(X') for the 
subset of observations corresponding to a subset X' of 
the leaves. In rooted trees, we use the terminology rooted 
clade to describe a maximal subset of leaves X'cX that 
are descendant from an internal node. For a rooted tree 
t, we denote the set of such subsets by L T (t). In the un- 
rooted tree case, the set £ u (f) is the set of blocks in the 
bipartitions obtained by removing an edge in the tree; 
that is, Zu(f ) = L r (f) U {X\X' : X' e I r (f')}, where f is an 
arbitrary rooting of t. 

We assume that the trees t C T contain positive 
branch lengths, encoded as maps from subsets of leaves, 
2 X , to the nonnegative real numbers. In the rooted case, 
b r (X' ; t) is equal to zero if X' ^ Z r (t) and equal to the 
length of the edge joining the root of the clade X' to its 
parent otherwise. In the nonclock case, b u (X' ; t) is equal 
to zero if X' ^ £ u (£) and to the edge corresponding to 
the bipartition X'|X\X' otherwise. 

We denote the unnormalized target posterior measure 
given y by rcy or n for short. (Formally, TCy :!Fq — > [0, oo), 
where we use Tf to denote the Borel sigma algebra on 
T.) The measure tc is the product of a prior and likeli- 
hood evaluated at the observations, summing over the 
states at the internal nodes. We assume that it has an 
unnormalized density yy : T — > [0, oo). In most mod- 
els, y — yy can be evaluated at any point in polyno- 
mial time (Felsenstein 2003), whereas the normalization 
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||7t|| = 7t(T), which is equal to the marginal likelihood 
V(y) by Bayes' theorem, is hard to compute — estimating 
it is the first of the two goals of the method of the next 
section. 

The second goal was to compute posterior expecta- 
tions of the form E[c|>(T)|3>] for a problem-dependent 
test function cf) : T — ► K c for some c e N. These functions 
are generally the sufficient statistics needed to compute 
Bayes estimators. To define Bayes estimators and to 
evaluate their reconstructions, we will make use of the 
partition metric (which ignores branch lengths), dpM/ and 
the LI and (squared) L2 metrics, dn,di2, which take 
branch lengths into account (Bourque 1978; Robinson 
and Foulds 1981; Kuhner and Felsenstein 1994). We will 
use the unrooted versions of these metrics (which al- 
lows us to measure distance between rooted trees as 
well by ignoring the rooting information): 

d PM (f,f')H£u(f)AIu(f')l, 
d u (t,t') = K(X'}t)-b n (X';t% 

X'CX 

<M*.f)= ^(k u (X';f)-MX';f)) 2 , 

X'CX 

where AAB denotes the symmetric difference of sets A 
and B. Note that a loss function can be derived from 
each of these metrics by taking the additive inverse. For 
example, if the objective is to reconstruct a consensus 
topology with the least partition metric risk, each coor- 
dinate of the function, 4>„ takes the form of an indica- 
tor function on a unrooted clade X, c X, 4>,(f) = 1[X, C 

We will use the notation n = 7t/||7t|| for normalized 
measures and 7t((J)) as a shorthand for integration of 4> 
with respect to ft, for example, E [cf)(T)|[V] here. 

We will also need some concepts from order theory. 
A poset (partially ordered set) (S, -<) is a generalization 
of the familiar order relation <. Poset relations -< sat- 
isfy many of the properties found in < (reflexivity anti- 
symmetry, and transitivity), but they do not require all 
pairs of elements of S to be comparable (i.e., there can 
be s, s' e S with neither s -< s' nor s' -< s). We say that s' 
covers s if s -< s', and there are no s" other than s, s' with 
s -< s" -< s'. Finally, we will refer in the following to the 
(undirected) Hasse diagram of a poset, which is an undi- 
rected graph where the set of vertices is S, and there is 
an edge between s and s' whenever s covers s'. 

Poset SMC on Phylogenetic Trees 

We now turn to the description of the PosetSMC 
framework. This framework encompasses existing work 
on tree-based SMC (Teh et al. 2008) as a special case and 
yields many new algorithms. 

Overview 

PosetSMC is a flexible algorithmic framework, with 
the flexibility deriving from two sources. The first is the 



choice of proposal: Just as in MCMC methods, PosetSMC 
requires the specification of a proposal distribution and 
there is freedom in this choice. Second, PosetSMC re- 
quires an extension of the density y, which is defined on 
trees, to a density y* that is defined on forests. 

Figure 1 presents an overview of the overall PosetSMC 
algorithmic framework. It will be useful to refer to this 
figure as we proceed through the formal specification of 
the framework. 

We begin by discussing proposal distributions. Note 
that MCMC algorithms also involve proposal distribu- 
tions, but, in contrast to MCMC, where the proposals 
are transitions from T to T, the proposals in PosetSMC 
are defined over a larger space, S D T, which we will be 
endowing with a partial order structure. The elements 
of this larger space are called partial states, and they 
have the same dual interpretation as the partial states 
described in the context of maximization algorithms in 
the introduction. We denote the dual of s by D(s), where 
D : S — » Tf. We write "V s for the proposal distribution (a 
regular conditional probability) at s, with density ^(s — > 
s'), s,s' € S, and we let q n denote the n-step proposal 
density. Although an MCMC proposal is generally de- 
fined using a metric (i.e., sampling is done within a sub- 
set of states that are nearby the current state), we need 
the richer structure of posets to define proposals in our 
case. 

In particular, we assume that the proposal distribu- 
tions are such that they allow a poset representation. 

Assumption 1 We assume that there is a poset (S, -<) 
such that q n (s — > s') > 0 for some n ^ 1 if and only if 

s -< s'. 

The associated poset representation encodes whether 
states are reachable via applications of the proposal 
distribution. Note that this puts restrictions on valid 
proposal distributions: In particular — and in contrast 
to MCMC proposals — directed cycles should have zero 
density under q. 

Using the proposal distribution, our algorithms it- 
eratively propose successor states s' with s -< s', until 
the algorithm arrives at partial states that fully specify 
a phylogenetic tree, where \D(s)\ = 1. In order to keep 
track of the number of proposal applications needed 
before arriving at a fully specified state, we assume that 
the posets are equipped with a rank, a monotone func- 
tion p : S — > {0, . . . , R} such that whenever s covers s', 
p(s) = p(s') + 1. At each proposal step, the rank is in- 
creased by one, and the set of states of highest rank R is 
assumed to coincide with the set of fully specified states: 
p -1 (R) = T. At the other extremity of the poset, we let 
_L denote the unique minimum element with D(s) = T. 

In addition to a proposal distribution, a second object 
needs to be provided to specify a PosetSMC algorithm: 
an extension y„ : S — > [0, oo) of the density y from T to S. 

In the remainder of the paper, we provide examples 
of proposals and extensions, and we also provide a pre- 
cise set of conditions that are sufficient for correctness of 
a PosetSMC algorithm. Before turning to those results, 
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FIGURE 1 . An overview of the PosetSMC algorithmic framework. A PosetSMC algorithm maintains a set of partial states (three partial states 
are shown in the leftmost column in the figure; each partial state is a forest over the leaves A, B, C, and D). Associated with each partial state is a 
positive-valued weight. The algorithm iterates the following three steps: (i) resample from the weighted partial states to obtain an unweighted 
set of partial states, (ii) propose an extension of each partial state to a new partial state in which two trees in the forest have been connected, and 
(iii) calculate the weights associated with the new partial states. 



we give a simple concrete example of a proposal distri- 
bution in the case of ultrametric trees. 

In an ultrametric setup, the examples we consider are 
based on S defined as the set of ultrametric forests over 
X. An ultrametric forest s = {(f,, X,)} is a set of ultramet- 
ric X,-trees f, such that the disjoint union of the leaves 
yields the set of observed taxa: UX, =X. The rank of such 
a forest is defined as |X| — |s|. 

Defining the height of an ultrametric forest as the 
height of the tallest tree in the forest, we can now intro- 
duce the partial order relationship we use for ultramet- 
ric setups. Let s and s' be ultrametric forests. We deem 
that s -< s' if all the trees in s appear as subtrees in s' with 
matching branch lengths and if the height of s' is strictly 
greater than the height of s. As we will see shortly, any 
proposal that simply merges a pair of trees and strictly 
increases the forest height is a valid proposal. 



Algorithm Description 

Once these two ingredients are specified — a proposal 
and an extension — the algorithm proceeds as follows. At 
each iteration r, we assume that a list of K partial states 
is maintained (each element of this list is called a parti- 
cle). These particles are denoted by s r ,i, . . . , s,.^ e S. We 



also assume that there is a positive weight iu r j ( associated 
with each particle s r £. Combined together, these form an 
empirical measure: 

K 

n r,K(-)=^2 w r,^s r , k {-), (1) 
k=l 

where S X (A) = 1 if x € A and 0 otherwise. 

Initially, Sq^ = _L and zvq^ = 1/K for all k. Given the list 
of particles and weights from the previous iteration r—1, 
a new list is created in three steps. The first step can be 
understood as a method for pruning unpromising par- 
ticles. This is done by sampling independently K times 
from the normalized empirical distribution 7t,— The 
result of this step is that some of the particles (mostly 
those of low weight) will be pruned. (Other sampling 
schemes, such as stratified sampling and dynamic on- 
demand resampling, can be used to further improve 
performance; see Doucet et al. 2001.) We denote the 
sampled particles by S r -i t i, ■ ■ ■ , s,-!^- The second step 
is to create new particles, s, % i, . . . , s r ^, by extending the 
partial states of each of the sampled particles from the 
previous iteration. This is done by sampling K times 
from the proposal distribution, s r ^ ~ "Vj^._j t . The third 
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step is to compute weights for the new particles: 



y*{ s r-l,k)q[Sr-l,k -> s r,k) 

Finally, the target distribution is approximated by 
t^r,k, so that the target conditional expectation 7r((j)) is 
approximated by ft^(<p). As for the marginal likeli- 
hood, the estimate is given by the product over ranks of 
the weight normalizations: 

IMI«IIjFlK*ll- ^ 

r=\ 

It is worth highlighting some of the similarities and 
differences between PosetSMC and other sampling- 
based algorithms in phylogenetics, in particular the 
nonparametric bootstrap and MCMC algorithms. First, 
as in the case of the bootstrap, the K particles in Poset- 
SMC are sampled with replacement, and the number of 
particles (which remains constant throughout the run 
of the algorithm) is a parameter of the algorithm (see 
the Discussion section for suggestions for choosing the 
value of K). Second, new partial states obtained from 
the proposal distribution are always "accepted" (as op- 
posed to Metropolis-Hastings, where some proposals 
are rejected). Third, the weights of newly proposed 
states influence the chance each particle survives into 
the next iteration. Fourth, once full states have been 
created by PosetSMC, the algorithm terminates. Finally, 
PosetSMC is readily parallelized, simply by distributing 
particles across multiple processors. MCMC phyloge- 
netic samplers can also be parallelized, but the paral- 
lelization is less direct (see the Discussion section for 
further discussion of this issue). 

Theoretical Guarantees 

In this section, we give theoretical conditions for sta- 
tistical correctness of PosetSMC algorithms. More pre- 
cisely, we provide sufficient conditions for consistency 
of the marginal likelihood estimate and of the target 
expectation as the number of particles K goes to infinity. 

The sufficient conditions are as follows. (Note that all 
of them have an intuitive interpretation and are easy 
to check.) The first group of conditions concern the 
proposal: 

ASSUMPTION 2 Let q : S -> S be a proposal density, 
with associated poset (<S, -<) as defined in the previous 
section. We assume that the (undirected) Hasse diagram 
corresponding to (S, -<) is (a) connected and (b) acyclic. 

Assumption 2(a) can be compared with an irre- 
ducibility condition in MCMC theory: There must be 
a path of positive proposal density reaching each state. 
Assumption 2(b) is more subtle but is very important 
in our framework. It can be restated as saying that for 
each partial state s, there should be at most one se- 
quence of partial states connecting it to the initial state 



_L = so, . . . ,s„ = s, with q(sj — > s,-+i) > 0 for all i. This 
insures that trees are not overcounted. 

Next, we introduce the conditions on the extension y „ : 

ASSUMPTION 3 The density y* is (a) positive, y*(s) > 0 
for all s G S, and (b) it extends y, in the sense that there 
is a C > 0 with y = ItCy* ■ 

Note that we do not require that it be feasible to 
compute C — its value is not needed in our algorithms. 
Assumption 3(a) insures that the ratio in Equation (2) is 
well defined, and Assumption 3(b) is the step where the 
form of the target density y is taken into account. 

Under these assumptions, and assuming regularity 
conditions on (J) and y«, we have that PosetSMC is con- 
sistent (see Appendix 1 for the proof): 

PROPOSITION 4 If Assumptions 1, 2, and 3 are met, and 
y* is bounded, then as K — > oo, 

KkII *-=^imi. ( 4 ) 

r=l 

and moreover, when 7t(|4>|) < oo, 

ttr,.k(4>) ^ 7t(4>)- (5) 

Examples 

In this section, we provide several examples of 
proposal distributions and extensions that meet the 
conditions described in the previous section. 

Proposals 

In the ultrametric case, we have given in the Overview 
section a recipe for creating valid proposal distributions. 
In particular, we can use proposals that merge pairs 
while strictly increasing the height of the forest. We be- 
gin this section by giving a more detailed explanation of 
why the strict increase in height is important and how it 
solves an overcounting issue. 

To understand this issue, let us consider a coun- 
terexample with a naive proposal that does not satisfy 
Assumption 2(b) and show that it leads to a biased esti- 
mate of the posterior distribution. For simplicity, let us 
consider ultrametric phylogenetic trees with a uniform 
prior with unit support on the time between speciation 
events. 

There are (2n — 3)!! = lx3x5 = 15 rooted topologies 
on four taxa, and in Figure 2, we show schematically a 
subset of the Hasse diagram of the particle paths that 
lead to these trees under the naive proposal described 
above. It is clear from the figure that a balanced binary 
tree on four taxa, with rooted clades {A, B}, {C, D}, can 
be proposed in two different ways: either by first merg- 
ing A and B, then C and D, or by merging C and D, then 
A and B. On the other hand, an unbalanced tree on the 
same set of taxa can be proposed in a single way, for 
example, {A,B} — » {A,B,C}. As a consequence, if we 
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FIGURE 2. To illustrate how PosetSMC sequentially samples from the space of trees, we present a subset of the Hasse diagram induced 
by the naive proposal described in the Examples section. Note that this diagram is not a phylogenetic tree: The circles correspond to partial 
states (phylogenetic forests), organized in rows ordered by their rank p, and edges denote a positive transition density between pairs of partial 
states. The forests are labeled by the union of the sets of nontrivial rooted clades over the trees in the forest. The dashed lines correspond to 
the proposal moves forbidden by the strict height increase condition (Assumption 2(b) in the text). Note that we show only a subset of the 
Hasse graph since the branch lengths make the graph infinite. The subset shown here is based on an intersection of height function fibers: Given 
a subset of the leaves X' C X, we define the height function hx* (s) as the height of the most recent common ancestor of X' in s, if X' is a clade 
in one of the trees in s, and u> otherwise, where w £ E. Given a map / : 2 X — » [0, oo), the subset of the vertices of the Hasse diagram shown 
is given by Dx/cx/i^^X'), cu}). The graph shown here corresponds to any/ such that /({A, B}) < f({C,D}),f({A,C}) < /({B,D}),and 
f({A,D})<f({B,C}). 



assume there are no observations at the leaves, the ex- 
pected fraction of particles with a caterpillar topology of 
each type is 1/18 (because there are 18 distinct paths in 
Fig. 2), whereas the expected fraction of particles with a 
balanced topology of each type is 2/18 = 1/9. However, 
since we have assumed there are no observations, the 
posterior should be equal to the prior, a uniform distri- 
bution. Therefore, the naive proposal leads to a biased 
approximate posterior. 

The strict height increase condition incorporated in 
our proposal addresses this issue. The dashed lines in 
Figure 2 show which naive proposals are forbidden by 
the height increase condition. After this modification, 
the bias disappears: 

PROPOSITION 5 Proposals over ultrametric forests that 
merge one pair of trees while strictly increasing the 
height of the forest satisfy Assumption 2. 

Proof. It is enough to show that each s € S covers at most 
one s' € S. If s = _L, this holds trivially. If s =/ _L, there 
is a unique s' covered by s, given by splitting the unique 
tallest tree in the forest (removing the edges connected 
to the root). □ 

The proposals used in Teh et al. (2008) all fall in this 
category. For example, for the "PriorPrior" proposal "V s , 
a pair of trees in the forest s is sampled uniformly at 
random, and the height increment of the forest is sam- 
pled from an exponential with rate ('j), the prior wait- 
ing time between two coalescent events. 

Again, many other options are available. For exam- 
ple, even when the prior is the coalescent model, one 
may want use a proposal with fatter tails to take into 
account deviation from the prior brought by the likeli- 
hood model. One way to achieve this is the "PostPost" 
proposal discussed by Teh et al. (2008), where local pos- 
teriors are used for both the height increment and the 
choice of trees to coalesce (see Appendix 2 for further 



discussion of this proposal). That approach has some 
drawbacks, however; it is complex to implement and 
is only applicable to likelihoods obtained from Brown- 
ian motion. Simpler heavy-tailed proposal distributions 
may be useful. 

Proposals can also be informed by heuristics H as dis- 
cussed in the next section. This can be done, for example, 
by giving higher proposal density to pairs of trees that 
form a subtree in H(s). 

In the nonclock case, we let S be the set of rooted non- 
clock forests over X. A nonclock forest, s = {(£,, X,)}, is a 
set of nonclock X, -trees f, such that the disjoint union of 
the leaves consists in the set of observed taxa, UX, = X. 
Defining the diameter of a rooted forest as twice the max- 
imum distance between a leaf and a root over all trees in 
the forest, we get that any proposal that merges a pair of 
trees and strictly increases the forest diameter is a valid 
proposal. The unique state covered by s =/ _L is the one 
obtained by splitting the tree with the largest diameter. 

Extensions 

We turn to the specification of extensions, y», of the 
density y from T to S. There is a simple recipe for ex- 
tensions that works for both nonclock and ultrametric 
trees: Given a posterior distribution model ny, set the 
extension over a forest s = {(f,, X,)} to be equal to 

Y.(s)= II Yy(*)(*i)- 
(t„x,)es 

We call this extension the natural forest extension. This 
definition satisfies Assumption 3 by construction. 

More sophisticated possibilities exist, with different 
computational trade-offs. For example, it is possible to 
connect the trees in the forest on the fly, by using a fast 
heuristic such as neighbor joining (Saitou and Nei 1987). 
If we let H : S — > T denote this heuristic, we then get this 
alternate extension: 
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FIGURE 3. Comparison of the convergence time of PosetSMC and MCMC. We generated coalescent trees of different sizes and data sets 
of 1000 nucleotides. We computed the LI distance of the minimum Bayes risk reconstruction to the true generating tree as a function of the 
running time (in units of the number of peeling recursions, on a log scale). The missing MCMC data points are due to MrBayes stalling on these 
executions. 



Y.(s)=Y(H(s)). 

As long as all the trees in s appear as subtrees of H(s), 
this definition also satisfies Assumption 3. This ap- 
proach can also be less greedy, by taking into account 
the effect of the future merging operations. We present 
other examples of extensions in Appendix 2. 

EXPERIMENTS 

In this section, we present experiments on real and 
synthetic data. We first show that for a given error 
level, SMC takes significantly less time to converge to 
this error level than MCMC. For the range of tree sizes 
considered (5-40), the speed gap of SMC over MCMC 
was around two orders of magnitude. Moreover, this 
gap widens as the size of the trees increases. We then 
explore the impact of different likelihoods, tree priors, 
and proposal distributions. Finally, we consider experi- 
ments with real data, where we observe similar gains in 
efficiency as with the simulated data. 

Computational Efficiency 

We compared our method with MCMC, the standard 
approach to approximating posterior distributions in 
Bayesian phylogenetic inference (see Huelsenbeck et al. 
2001 for a review). We implemented the PosetSMC al- 
gorithm in Java and used MrBayes (Huelsenbeck and 
Ronquist 2001) as the baseline MCMC implementation. 
A caveat in these comparisons is that our results depend 
on the specific choice of proposals that we made. 

In the experiments described in this section, we gen- 
erated 40 trees from the coalescent distribution of sizes 
{5,10,20,40} (10 trees of each size). For each tree, we 
then generated a data set of 1000 nucleotides per leaf 
from the Kimura two-parameter model (K2P) (Kimura 
1980) using the Doob-Gillespie algorithm (Doob 1945). 
In this section, both the PosetSMC and MCMC algo- 
rithms were run with the generating K2P model and 
coalescent prior, fixing the parameters. We use the Pri- 
orPrior proposal as described in Teh et al. (2008). Pri- 
orPrior chooses the trees to merge and the diameter of 
the new state from the prior; that is, the trees are chosen 



uniformly over all pairs of trees, whereas the new diam- 
eter is obtain by adding an appropriate exponentially 
distributed increment to the old diameter. We consider 
bigger trees as well as other proposals and models in 
the next sections. 

For each data set, we ran MCMC chains with increas- 
ing numbers of iterations from the set {10 3 , 10 4 , 10 5 , 10 6 }. 
We also ran PosetSMC algorithms with increasing num- 
bers of particles from the set {10 1 , 10 2 , 10 3 , 10 4 }. Each 
experiment was repeated 10 times, for a total of 3200 
executions. 

We computed consensus trees from the samples and 
measured the distance of this reconstruction to the gen- 
erating tree (using the metrics defined in the Back- 
ground and Notation section). The results are shown in 
Figure 3 for the LI metric. For each algorithm setting 
and tree size, we show the median distance across 100 
executions, as well as the first and third quartiles. A 
speedup of over two orders of magnitudes can be seen 
consistently across these experiments. 

In both the PosetSMC and MCMC algorithms, the 
computational bottleneck is the peeling recurrence 
(Felsenstein 1981), which needs to be computed at each 
speciation event in order to evaluate y(t). Each call re- 
quires time proportional to the number of sites times 
the square of the number of characters (this can be ac- 
celerated by parallelization, but parallelization can be 
implemented in both MCMC and PosetSMC samplers 
and thus does not impact our comparison). We therefore 
report running times as the number of times the peeling 
recurrence is calculated. As a sanity check, we also did a 
controlled experiment on real data in a single user pure 
Java setting, showing similar gains in wall clock time. 
(These results are presented in the Experiments on Real 
Data section). 

In Figures 4 and 5, we show results derived from the 
series of experiments described above for other met- 
rics. Note, in particular, the result in Figure 4; we see 
that for a fixed computational budget, the gap between 
PosetSMC and MCMC increases dramatically as the size 
of the tree increases. 
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FIGURE 4. LI distances of the minimum Bayes risk reconstruction 
to the true generating tree (averaged over trees and executions) as a 
function of the tree size (number of leaves on a log scale), measured 
for SMC algorithms run with 1000 particles and MCMC algorithms 
run for 1000 iterations. 



Varying Proposals, Priors, and Likelihoods 

In this section, we explore the effect of changing pro- 
posals, priors, and likelihood models. We also present 
results measured by wall clock times. 

We first consider data generated from coalescent trees. 
The proposal distribution is used to choose the trees in 
a partial state that are merged to create a new tree (in 
the successor partial state) as well as the diameter of 
the new state. In Figure 6, we compare two types of 
proposal distributions, PriorPrior, described in the pre- 
vious section, and PriorPost (Teh et al. 2008). PriorPost 
chooses the diameter of the state from the prior and 
then chooses the pair of trees to merge according to a 
multinomial distribution with parameters given by the 
likelihoods of the corresponding new states. (We pro- 
vide further discussion of this proposal and PriorPrior 
in Appendix 2.) Although these proposals were investi- 
gated experimentally in Teh et al. (2008), their running 
times were not compared in a satisfactory manner in 
that work. In particular, the running time was estimated 
by the number of particles. This is problematic since for 
a given binary X-tree, PriorPost uses 0(|X| 3 ) peeling 
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FIGURE 5. Analysis of the same data as in Figure 3, for 20 leaves, 
but with different metrics: L2 and Partition metrics, respectively. 
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FIGURE 6. Comparison of two types of SMC proposal distributions. 



recurrences per particle, whereas PriorPrior uses only 
0(|X|) recurrences per particle. Since we measure run- 
ning time by the number of peeling recurrences, our 
methodology does not have this problem. Surprisingly, 
as shown in Figure 6, PriorPrior outperforms the more 
complicated PriorPost by one order of magnitude. We 
believe that this is because PriorPost uses a larger frac- 
tion of its computational budget for the recent specia- 
tion events compared with PriorPrior, whereas the more 
ancient speciation events may require more particles to 
better approximate the uncertainty at that level, (e.g., 
suppose, we have a budget of 0(|X| 3 ) peeling recur- 
rences. A PriorPrior sampler can use 0(|X| 2 ) particles 
and leverages 0(|X| 2 ) peeling recurrence for proposing 
the top branch lengths. A PriorPost sampler can use 
only O(l) particles and therefore uses only O(l) peeling 
recurrences for proposing the top branch lengths.) 

Figure 7 shows the results for data generated by Yule 
processes (Rannala and Yang 1996) and uniform-branch- 
length trees. We see that PosetSMC is superior to MCMC 
in these settings as well. 
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FIGURE 7. Experiments with trees generated from different mod- 
els. We consider data generated by Yule processes and uniform- 
branch-length trees. We compare the LI distance of the minimum 
Bayes reconstruction with the true generating tree for PosetSMC and 
MCMC. 
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FIGURE 8. Experiments on synthetic gene frequencies using a 
Brownian motion likelihood model. We show results for two tree sizes. 
In each case, we plot the partition metric as a function of the wall time 
in milliseconds, shown on a log scale 

Next, we did experiments using a different type of 
data: synthetic gene frequency. We used Brownian mo- 
tion to generate frequencies, based the likelihood func- 
tion on the same model as before and a coalescent prior. 
As a baseline, we used an MCMC sampler based on 
standard proposals (Lakner et al. 2008) (stochastic near- 
est neighbor interchange, global and local multiplica- 
tive branch rescaling), with four tempering chains (Neal 
1996). Our implementations of continuous-character 
MCMC and SMC use the same code for computing the 
likelihood. We verified this code by comparing likeli- 
hood values against CONTML (Felsenstein 1973). Since 
computing the peeling dynamic program is the com- 
putational bottleneck for both SMC and MCMC, it is 
meaningful to compare wall clock times. 

The results are shown in Figure 8. In both exper- 
iments, there is again a wide computational regime 
where SMC outperforms MCMC. 

Experiments on Real Data 

We also tested our algorithm on a comparative RNA 
data set (Cannone et al. 2002) containing hand-aligned 
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ribosomal RNA. We sampled 16S components from the 
three domains of life at random to create multiple data 
sets. 

We use the log likelihood of the consensus tree to 
evaluate the reconstructions. Since finding states of 
high density is necessary but not sufficient for good 
posterior approximation, this provides only a partial 
picture of how well the samplers performed. Since the 
true tree is not known, however, this gives a sensible 
surrogate. 

We show the results in Figure 9. As in the syn- 
thetic data experiments, we found that the PosetSMC 
sampler required around two orders of magnitude 
less time to converge to a good approximation of the 
posterior. Moreover, the advantage of PosetSMC over 
MCMC becomes more pronounced as the number of 
taxa increases. For large numbers of iterations and 
particles, the MCMC sampler slightly outperformed 
the PosetSMC sampler on the real data we used. 

We also performed experiments on frequency data 
from the Human Genome Diversity Panel. In these ex- 
periments, we subsampled 11,511 Single Nucleotide 
Polymorphisms to reduce site correlations, and we used 
the likelihood model based on Brownian motion de- 
scribed in the previous section. We show the results 
in Figure 10, using the log likelihood of the consensus 
tree as an evaluation surrogate. This shows once again 
the advantages of SMC methods. To give a qualitative 
idea of what the likelihood gains mean, we show in 
Figure 11 the consensus tree from 10,000 MCMC itera- 
tions versus the consensus tree from 10,000 PosetSMC 
particles (the circled data points). Since both runs are 
under sampled, the higher-order groupings are incor- 
rect in both trees, but we can see that more mid- and 
low-order ethnic/geographic groupings are already 
captured by SMC. Incidentally, the position of the cir- 
cled data points show that in practice, K SMC particles 
are cheaper to compute than K MCMC iterations. This 
is because fewer memory writes are necessary in the 
former case. 
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FIGURE 9. Results on ribosomal RNA data (Cannone et al. 2002) on different tree sizes, comparing the log likelihood of the minimum Bayes 
risk reconstruction from SMC and MCMC approximations, as a function of the running time (in units of the number of peeling recursions on a 
log scale). 
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FIGURE 10. Results on human gene frequency data (Li et al. 2008), 
comparing the log likelihood of the minimum Bayes risk reconstruc- 
tion from SMC and MCMC approximations, as a function of the run- 
ning time (in milliseconds, on a log scale). 

Discussion 

We have presented a general class of SMC methods 
that provide an alternative to MCMC for Bayesian 
inference in phylogeny. By making use of the order- 
theoretic notion of a poset, our PosetSMC methods are 
able to exploit the recursive structure of phylogenetic 
trees in computing Bayesian posteriors and marginal 
likelihoods. Experimental results are quite promising, 
showing that PosetSMC can yield significant speedups 
over MCMC; moreover, the relative improvement over 
MCMC appears to increases as the number of taxa in- 
creases, even for simple proposal distributions. 

Although we have used simple likelihoods in our 
experiments, it is also possible to incorporate state-of- 
the-art models in our framework. Whenever the like- 
lihood for a given tree can be computed exactly and 
efficiently in a bottom-up manner, our framework ap- 
plies directly. This includes, for example, models based 
on more sophisticated rate matrices such as the general- 
ized time-reversible model (Tavare 1986), covarion mod- 
els (Penny et al. 2001), approximate context-dependent 
models (Siepel and Haussler 2004), and fixed-alignment 
insertion-deletion-aware models (Rivas 2005). 

It is also possible to accommodate models that re- 
quire alternating sampling of parameters and topol- 
ogy, such as random rate matrix and relaxed-clock 
models (Thorne et al. 1998; Huelsenbeck et al. 2000; 
Drummond et al. 2006) and insertion-deletion mod- 
els with alignment resampling (Redelings and Suchard 
2005). This can be done by making use of the Particle 
Markov chain Monte Carlo (PMCMC) method intro- 
duced in Andrieu et al. (2010). PMCMC is a hybrid of 
MCMC and SMC in which an SMC algorithm is used 
as a proposal in an MCMC chain. Remarkably, it is pos- 
sible to compute the acceptance ratio for this complex 



proposal exactly, directly from the output of SMC algo- 
rithm. Each factor in the acceptance ratio is computed 
from the data likelihood estimator of Equation (3): If we 
denote the estimate of Equation (3) for the current and 
previous MCMC iteration by L' and L, respectively, then 
the acceptance ratio is simply min {1, L'/L}. 

As alluded to earlier, an important advantage of 
SMC over MCMC is the ease in which it can adapt to 
parallel architectures. Indeed, the peeling recurrence 
when computing the proposals is the bottleneck in 
PosetSMC, and this computation can be easily paral- 
lelized by distributing particles across cores. Distri- 
bution across clusters is also possible. Note that by 
exchanging particles across machines after comput- 
ing the resampling step, it is possible to avoid moving 
around many particles. In contrast, parallelization of 
MCMC phylogenetic samplers is possible (Feng et al. 
2003; Altekar et al. 2004; Keane et al. 2005; Suchard 
and Rambaut 2009) but is not as direct. Moreover, all 
three main approaches to MCMC parallelization have 
limitations: The first method is to parallelize the like- 
lihood computation, including Graphics Processing 
Unit parallelization; when scale comes from a large 
number of taxa, however, this approach reaches its 
limits. The second method involves running several 
independent chains; however, this approach suffers 
from a lack of communication across processors: If one 
processor finds a way of getting out of a local opti- 
mum, it has no way of sharing this information with 
the other nodes. The third method is to augment the 
MCMC chain using auxiliary variables, for example, by 
using parallel tempering (Swendsen and Wang 1986), 
or multiple-try Metropolis-Hastings (Liu et al. 2000). 
However, these state augmentations induce trade- 
offs between computational efficiency and statistical 
efficiency. 

A practical question in the application of PosetSMC is 
the choice of K, the number of particles. Although this 
choice depends on the specific test functions that are 
being estimated, one generic measure of the accuracy of 
the PosetSMC estimator can be based on the effective 
sample size (ESS) (Kong et al. 1994) — the number of 
independent samples from the true posterior required 
to obtain an estimator with the same variance as the 
PosetSMC estimator. A small ESS (relative to K) is in- 
dicative that the SMC is inefficient. Such inefficiencies 
might be alleviated by increasing K or by changing the 
proposals. One method for estimating ESS in the context 
of classical SMC (Carpenter et al. 1999) can be applied 
here. The idea is to run PosetSMC on the same data L 
times for a fixed K. For scalar functions cf), the ratio of 
the variance of tfj estimated within each run to the vari- 
ance estimated across the L runs is an estimate of ESS. 

PosetSMC algorithms can also benefit from improved 
sampling techniques that have been developed for other 
SMC algorithms. For example, standard techniques 
such as stratified sampling (Kitagawa 1996) or adaptive 
resampling (Moral et al. 2011) can be readily applied. 
Other potential improvements are reviewed in Cappe 
et al. (2005) and Doucet and Johansen (2009). 
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Effect of under-sampling on the HGDP dataset 
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FIGURE 11. In this figure, we give a qualitative interpretation for the difference in log likelihood in Figure 10 for the consensus tree obtained 
from SMC with 10,000 particles and from MCMC with 10,000 iterations. Since both runs are under sampled, the higher-order groupings are 
incorrect in both trees, but we can see that more mid- and low-order ethnic/ geographic groupings are already captured by SMC. 
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Note finally that the poset framework applies be- 
yond phylogenetic tree reconstruction. Other examples 
of Bayesian inference problems in computational biol- 
ogy that may benefit from the PosetSMC framework 
include statistical alignment for proteins and DNA and 
grammatical inference for RNA. 
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Appendix 1 

Proof of Consistency 

In this section, we prove Proposition 4. The proof is 
similar conceptually to proofs of consistency for non- 
poset frameworks (Crisan and Doucet 2002; Moral 2004; 
Douc and Moulines 2008), with the main points of inter- 
est being the specific ways in which Assumptions 1, 2, 
and 3 are invoked to permit the generalization to posets. 
For simplicity we present the argument for convergence 
in L 2 here (using — > to denote convergence in L 2 ), assum- 
ing bounded y*; see Moral (2004) for many extensions, 
including strong laws and central limit theorems. 

Proof. We start by introducing a measure \ r ,K for the un- 
normalized distribution over the particles proposed at 
an SMC iteration r with K particles. From Assumption 1 
and 2(b), we see that A has the following form, for r > 0 
and all A e Ts- 

\ rtK (A) = Ti r _ ltK (-v.(A)). 

Note that in this section, we view n r j and hence 7\ r ^ 
as random variables, so we will need a notation for the 
limit as K — > oo of the ~\ r ,K measures, which, as we will 
see shortly has the form: 

\(A)=n r - 1 (-v.(A)), 

where n r is defined with the following Radon-Nikodym 
derivative relative to a dominating base measure \x: 



dn r 
d\i 



(s)=y*(s)l[p(s) =r]. 



The main step of the proof is to show by induction on 
r that: 



1 

K 



K->c 



«r-l 



and 



ftr,K{4>*) > 7t r (4 > *)> 



(7) 
(8) 



for any integrable cb^ : S — > K, at a rate of C/n. This is 
sufficient since Equation (4) can be obtained by taking 



the product over Equation (7) for r E {1, ... ,R}, and 
Equation (5), by specializing Equation (8) to r = R with 
4>*(s) = cb(s) for s e T and cj)*(s) = 0 otherwise, and by 
using Assumption 3(b). 

The base case is a standard importance sampling ar- 
gument, so we concentrate on the induction step. 

Note first that by Assumptions 1 and 2(a, b), the 
weight w r i is a deterministic function of s r y. Since the 
poset is acyclic and connected, there exists a unique 
s r— l,* covered by s r ^, denoted by pa(s ri ;t) which gives 
all the ingredients needed to compute Equation (2). 
Moreover, by the Radon-Nikodym theorem and 
Assumptions 1, 2(a), and 3(a), this function can be writ- 
ten as the product of two derivatives as follows: 



w Ti k = w(s) =t»( s ) 



1 



Y*(pa(s)) </(pa(s) -> s) 

dn r d[i 
d\± dA r 
dn r 
= dV 

We will let the number of particles at generation r and 
r' < r go to infinity separately, using the notation n r ^',K,- 
By applying the law of large numbers, we obtain 
(using the fact that the weights are bounded and 
uncorrelated): 



K r 



\ n r,K',K r 



dn r 
dY r 



Next, by applying the induction hypothesis to Equation 
(8) with r — 1 and cj) = y., we get that 



\K>(A) = iir-i,K>(-v.(A)) *-=3° ftr-i(y.(A)) = X r (A). 
Therefore, by setting K = K' = K r , we get: 



K limJlK,HA,.(^) (L 2 ) 

IK-ill r \dX r 

Ik-ll 



Fr-1 



using ||A r || = ||7t r _ i|| since the proposals "V. are assumed 
to be normalized. There is also a limit interchange argu- 
ment for this step that can be formalized using the rates 
of convergence and the Minkowski inequality. 

To prove the other induction component, Equa- 
tion (8), we first use Equation (7) to rewrite the weight 
normalization and then proceed similarly to the above 
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argument: 



K 
K 



tt,-,k(4>*) 



|7tr,x(4>*) K^oo ||7tr-l|| t /, d7t r 



1 x L d7t '' 



1 



7T r (ch«). 



7tr(4>*) 



□ 



Appendix 2 
Weight Updates 

In this section, we simplify Equation (2) in the case of 
PriorPrior and PriorPost proposals on coalescent trees, 
showing that the updates of Teh et al. (2008) are recov- 
ered in these special cases. 

In both cases, given a partial state s, a successor partial 
state (forest), s', is obtained from a previous partial state 
s by merging two trees, f; and t r , creating a new tree t m 
(see Fig. Al). Formally, we have that q(s — > s') > 0 im- 
plies that there are disjoint sets X ; = X;(s'),X r = X r (s') 
such that: 

s' = (s\{(f ; ,X),(f r ,X r )})U{(f„„X m )}, 

where X m — X m (s' ) = X; U X r , and t m = t m (s' ) is an X m -tree 
in which both l\ = f;(s') and t r = t r (s') occur as subtrees. 
As mentioned in the main text, the new forest is also 



required to be strictly taller. To formalize this, consider 
an ordering of the speciation events in a forest s, where 
the heights of these events are indexed in the order they 
were created in the path of partial states leading to s, 0 = 
ho,h\,..., /ip( s ) . Letting 5/(s) = hi — h{-\ denote the height 
increments, we require that for all i, 6,(s) > 0 with prob- 
ability one. We denote the tree t m by m s (f/, t r , 6), where 



6(s') 



>p(s') 



's'). The set of possible successor trees 



is denoted by S(s), and the subset of successors with top 
increment 5 is denoted by Sg (s) = {s' e S(s) : 6(s') = 6}. 

PriorPrior and PriorPost also share the same exten- 
sion, y», which is a product of three factors: one for the 
topology prior, one for the branch length prior, and one 
for the likelihood model Ly(t). The first prior factor is 
uniform over forest topologies, and so, this first factor 
can be ignored since y* is only needed up to a normal- 
ization constant. The second prior factor is written as 

a distribution over height increments, nj=i A;(6,(s)), 
where A; (6) is an exponential density for the coalescent 

prior: A,(6) =Expd(6 ; ( |x| ~' +1 ))- Finally, multiplying the 
likelihood by the prior yields: 

T*(s) oc 

where we use the symbol oc to mean that the expression 
is defined up to a proportionality constant that can only 
depend on s via |s|. 

We now consider the cancellations possible in Equa- 
tion (2) when the PriorPrior proposal, 

V- P r(s -> s') cx l[s' g S(s)]A p(s0 (6(s')), 



s 

o 




•A 
>B 






>C 












>D 








>E 











] 




h A ho h 



s 4 s 3 s 2 s l 



m 



FIGURE Al . A partial state s is extended to a new partial partial state s' by merging trees t; and t, to form a tree t m with height h$ > I13 . In the 
PriorPrior proposal, f; and tr are chosen uniformly from the three possible pairs, whereas the height increment 64 is chosen from an exponential 
distribution. In the PriorPost proposal, 64 is chosen from the exponential prior and, given 64, the pair to merge is chosen from a multinomial 
with parameters proportional to the likelihood of the tree t m . 
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is used: 

W pr -pr(s -> S') 



Y*(s') 



Y*(s)V-pi( s -> s ') 



A p(s / ) (5(s / ))Ly (Xtii(s , )) (t m (s')) 
L y(X,( s '))(W)) ^y(j;(^))(tr(s')) <7 P r-pr( s ~> s ') 



L y(X m ( S '))(^m(s / )) 
L ^(X,(s')) ( f '( S ')) L y(X,( S ')) (fr(s')) ' 

Note that cancellations such as these are not necessary 
to implementing PosetSMC; indeed, they do not play a 
significant computational role. Our goal in presenting 
them was simply to make precise the connection with 
Tehet al. (2008). 

In the PriorPost proposal, a height increment is first 
sampled from A p ( s /), then, conditioning on this incre- 
ment 5, the likelihood ratios 

R s ,S,(s') =l[s' e S 6 (s)] Ztf pr -pr(s -> s') 

are computed for all s' G S§ (s). These ratios are then nor- 
malized and become the parameters of a multinomial 
proposal over the pair of trees to coalesce: 



V"P°( S S ') = ^pr-pr( S s ') R s,b(s')(s')- 



We get the following simplified form of Equation (2) 
for PriorPost: 



I'pr- 



-p 0 (S -> S') : 



A p(s , ) (S(s'))L y(Xtii(s , )) (f m (s')) 

L y(X,( S '))(W)) Ly (Xr (s'))(tr(s')) (?pr-po(s ~> S ; ) 



UK. 



S ,5(s')l 



Finally we show how these weight updates can alter- 
natively (but less transparently) be expressed as the nor- 
malization of a specific version of the peeling recursion. 
Let y; denote observations for site / and let f^t denote 
the internal (hidden) nucleotide random variable at site 
j at the root of f . The recursion in question is: 



(i, j:tm =z\yj(X m )) 
c 



n e 

de{l,r} z' 



ntj, tl = i =z)n^,u =z , \y,(x d )) 

Therefore, the weight up- 



where C = , 7m rp 

dates can be obtained as functions of the product of the 
normalizations of the top-level recursions. 



